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(54) Method for summarizing a video using motion and color description 



(57) A method extracts an intensity of motion activity 
from shots in a compressed video. Trie method then us- 
es the intensity of motion activity io segment the video 
into easy and difficult segments to summarize, easy to 
summarize segments are represented by any frames 



selected from the easy to summarize segments, while 
a color based summarization process extracts gener- 
ates sequences of frames from each difficult to summa- 
rize segment. The selected and generated frames of 
each segment in each shot are combined to form the 
summary of the compressed video. 
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Description 

FIELD OF THE INVENTION 

[0001 J This invention relates generally to videos, and 
more particularly to summarizing a compressed video. 

BACKGROUND OF THE INVENTION 

[0002] It is desired to automatically generate a sum- 
mary of video, and more particularly, to generate the 
summary from a compressed digital Video. 

Compressed Video Formats 

[0003] Basic standards for compressing a video as a 
digital signal have been adopted by the Motion Picture 
Expert Group (MPEG). The MPEG standards achieve 
high data compression rates by developing information 
for afuil frame of the image only every so often. Thef ull 
image frames, i.e. intra-coded frames, are often reierred 
to as u l-frames" or "anchor frames," and contain full 
frame information independent of any other frames. Im- 
age difference frames, i.e., Inter-coded frames, are of- 
ten referred to as "B-frames" and "P4rames," or as "pre- 
dictive frames " and are encoded between the l-frames 
and reflect onry image differences i.e., residues, with re- 
spect to the reference frame, 
[00041 Typically, each frame of a video sequence is 
partitioned into smaller blocks of picture element. La. 
pixel, data. Each block is subjected to a discrete cosine 
transformation (OCT) function to convert the statistically 
dependent spatial domain pixels Into independent fre- 
quency domain OCT coefficients. Respective 8x8 or 
16x16 blocks of pixels, referred to as "macn>bIocks, M 
are subjected to the OCT function to provide the coded 
signal. 

[0005J The DOT coefficients are usually energy con- 
centrated so that only a few of the coefficients in a mac- 
ro-block contain the main part of the picture information. 
For example, If a macro-block contains an edge bound- 
ary of an object, th en th e energy in th at block, aftertrans- 
formation, as represented by the OCT coefficients, In- 
cludes a relatively large DC coefficient and randomly 
distributed AC coefficients throughout the matrix of co- 
efficients. 

[0006) A non-edge macro-block, on the other hand, is 
usually characterized by a similarly large DC coefficient 
and a few adjacent AC coefficients which are substan- 
tially larger than other coefficients associated with that 
block. The DCT coefficients are typically subjected to 
adaptive quantization, and then are run-length and var- 
iable-length encoded. Thus, the macro-blocks of trans- 
mitted data typically include fewer than an 8 x S matrix 
of codewords. 

[0007] The macro-blocks of inter-coded frame data, i. 
e., encoded P or B frame data, include OCT coefficients 
which represent only the differences between a predict- 
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ed pixels and the actual pixels In the macro-block. Mac- 
ro-blocks of intra-coded and inter-coded frame data also 
Include information such as the level of quantization em- 
ployed, a macro-block address or location indicator, and 

5 a macro-block type. The latter information is often re- 
ferred to as "header" or "overhead" information. 
[0008] Each P-frame is predicted from the lastmost 
occurring I- or P-frame. Each B-frame is predicted from 
an I- or P-frame between which it is disposed. The pre- 

10 dictrve coding process involves generating displace- 
ment vectors, often referred to as "motion vectors." 
which Indicate the magnitude of the displacement to the 
macro-block of an l-frame most closely matches the 
macro-block of the B- or P-frame currently being coded. 

15 The pixel data of the matched block in the I frame is 
subtracted, on a pixel-by-pixel basis, from the block of 
the P- or B-frame being encoded, to develop the resi- 
dues. The transformed residues and the vectors form 
part of the encoded data for the P-and B-frames. 

20 

Video Analysis 

[0009] Video analysis can be defined as processing a 
video with the intention of understanding the content of 
ss a video. The understanding of a video can range from 
a "low-level" syntactic understanding to a "high-lever 
semantic understanding. 

[0010] The low-level understanding can be achieved 
by analyzing low-level features, such as color, motion, 

so texture, shape, and the like. The low-level features can 
be used to partition the video into "shots." Herein, a shot 
is defined as a sequence of frames that begins when 
the camera is turned on and lasts until the camera is 
turned off. Typically, the sequence of frames in a shot 

S3 captures a single "scene." The low-level features can 
be used to generate descriptions. Tne descriptors can 
then be used to index the video, e.g., an index of each 
shot in the video and perhaps its length. 
[001 1] A semantic understanding of the video is con- 

40 cemed with the genre of the content, and not Its syntac- 
tic structure. For example, high-level features express 
whether a video is an action video, a music video, a 
■talking head" video, or the like. 

43 video Summarization 

[0D12] Video summarization can be defined as gen- 
erating a compact representation of a video that stillcon- 
veys the semantic essence of the vldao. The compact 

so representation can include ^key" frames or "key" seg- 
ments, or a combination of key frames and segments. 
As an example, a video summary of a tennis match can 
Include two frames, the first frame capturing both of the 
players, and the second frame capturing the winner with 

55 the trophy. A mote detailed and longer summary could 
further include all frames that capture the match point. 
While it is certainly possible to generate such a summa- 
ry manually, this is tedious and costly. Automatic sum- 
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marization Is therefore desired. 
[0O13] Automatic video summarization methods are 
well known, see S. Pferfer et al. in 'Abstracting Digtel 
Movies Automatically? J. Visual Comm. image Repre- 
sentation, voL 7, no. 4, pp. 345-353, December 1996. 
and Hanjaiic et ai. in "An Integrated Scheme for Auto- 
mated Video Abstraction Based on Unsupervised Clus- 
ter-Validity Analysis," IEEE Trans. On Circuits and Sys- 
tems for Video Technology, Vol. 9, No, 8, December 
1999. 

[001 4] Most known video summarization methods fo- 
cus exclusively on color-based summarization. Only 
Pfeifferetal. have used motion, In combination with oth- 
er features, to generate video summaries. However, 
their approach merely uses a weighted combination that 
overlooks possible correlation between the combined 
features. Some summarization methods also use mo- 
tion features to extract key frames. 
rpoiS] As shown in Figure 1 , prior art video summa- 
rization methods have mostly emphasized clustering 
based on color features, because color features are 
easy to extract and robust to noise. A typical method 
takes a video A 101 as input, and applies a color based 
summarization process 1 00 to produce a video summa- 
ry S(A) 1 02- The video summary consists of either a sin- 
gle summary of the entire video, or a set of interesting 
frames. 

[0016] The method 1 00 typically includes the follow- 
ingsteps. First clusterthef rames of the video according 
to color features. Second, arrange the clusters in an 
easy to access hierarchical datastructu re. Third, extract 
a key frame or a key sequence of frames from each of 
the cluster ^generate the summary. 

Motion Activity Descriptor 

(0017] A video can also be intuitively perceived as 
having various levels of activity or intensity of action. Ex- 
amples of a relatively high level of activity is a scoring 
opportunity in a sporting event video, on the other hand, 
a news readervldeo has a relatively low level of activity. 
The recently proposed MPEG-7 video standard pro- 
vides for a descriptor related to the motion activity In a 
video. 



number of frames it contains, for example, the number 
of key frames, orthe number of frames of key segments. 
[0020J The relative intensity of motion activity of a vid- 
eo is strongly correlated to changes In color character- 

5 istics. In other words, if the Intensity of motion activity is 
high, there is a high likelihood that change in color char- 
acteristics is also high. If the change in color character- 
istics is high, then a color feature based summary will 
include a relatively large number of frames, and if the 

10 change In color characteristics Is low, then the summary 
will contain fewer frames. 

[0021] For example, a "talking head" video typically 
has a low level of motion activity and very little change 
in color as well, if the summarization is based on key 

15 frames, then one key frame would suffice to summarize 
the video. If key segments are used, then a one-second 
seq uence of frames would suffice to visually summarize 
the video. On the other hand, a scoring opportunity in a 
sporting event would have very high intensity of motion 

20 activity and color change, and would thus take several 
key frames or several seconds to summarize. 
[0022] More particularly, the invention provides a 
method that summarizes a video by first extracting in- 
tensity of motion activity from a video. It then uses the 

25 intensity of motion activity to segment the video into 
easy and difficult segments to summarize. 
[00231 Easy t0 summarize segments are represented 
by a single frame, or selected frames anywhere in the 
segment, any frame will do because there is very lltti© 

50 difference between the frames in the easy to summarize 
segment. A color based summarization process Is used 
to summarize the hard segments. This process extracts 
sequences of frames from each difficult to summarize 
segment- The single frames and extracted sequences 

25 of frames are combined to form the summary of the vid- 
eov 

[0024] The combination can use temporal, spatial, or 
semantic ordering. In a temporal arrangement, the 
frames are concatenated in some temporal order, for ex- 

40 ample first-to-last, or last-to-first. In a spatial arrange- 
ment, miniatures of the frames are combined into a mo- 
saic or some array, for example, rectangular so that a 
single frame shows several miniatures of the selected 
frames of the summary. Asemantically ordered summa- 

45 ry might go from most exciting to least exciting. 



SUMMARY OF THE INVENTION 

[001 &] It is an objective of the present invention to pro- 
vide an automatic video summarization method using 
motion features, specifically motion activity features by 
themselves and in conjunction with other low-levet fea- 
tures such as color and texture features. 
[001 9] The main intuition beh ind the present invention 
is based on the following hypotheses. The motion activ- 
ity of a video Is a good indication of the relative difficulty 
of summarization the video. The greater the amount of 
motion, the more difficult it Is to summarize the video. A 
video summary can be quantitatively described by the 



BRIEF DESCRIPTION OF THE DRAWINGS 



60 



55 



[0025] 



Figure 1 is a block diagram of a prior art Video sum- 
marization method; 

Figures 2 and 3 are graphs plotting motion activity 
versus color changes for MPEG test videos; 

Figure 4 is a flow diagram of a video summarization 
method according to the invention; and 
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Figure 5 is a flow diagram of a color based summa- 
rization process according to the invention. 

Figure 6 is a block diagram of a summarization 
method according to the invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

[0026J Our invention summarizes a compressed vid- 
eo using color and motion features. Therefore, our sum- 
marization method first extracts features from the com- 
pressed video. 

Feature Extraction 

Color Features 

[0027] We can accurately and easily extract DC coef- 
ficients of an l-frame using known techniques. For P- 
and B-frames, the DC coefficients can be approximated 
using motion vectors without full decompression, see, 
for example, Yeo et al. "On the Extraction of PC Se- 
qusnco from MPEG video/ \EEB ICIPVoI.2, 1995. The 
YUV value of the DC image can be transformed to a 
different color space to extract the color features. 
[0028] The most popular used technique uses a color 
histogram. Color histograms have been widely used In 
image and video indexing and retrieval, see Smith et al. 
in 'Automated image Retrieval Using Color and Tex- 
ture," IEEE Transaction on Pattern Analysis and Ma- 
chine Intelligence, November 1996. Typically, In a three 
channel RGB color space, with four bins for each chan- 
nel, a total of B4 (4x4x4) bins are needed for the color 
histogram. 

Motion features 

[0023] Motion infonmation is mostly embedded In mo- 
tion vectors. Motion vectors can be extracted from P- 
and B-frames, Because motion vectors are usually a 
crude and sparse approximation to real optical flow, we 
only use motion vectors qualitatively. Many different 
methods to us© motion vectors are known," see Tan et 
al. "A new method for camera motion parameter estima- 
tion," Proc. IEEE International Conference on image 
Processing, Vol. 2, pp. 722-726, 1995, Tan et al. "Rapid 
estimation of camera motion from compressed video 
with application to video annotation' to appear In IEEE 
Trans, on Circuits and Systems for Video Technology, 
1999. Kobla et al. 'Detection of stow-motion repiayse- 
quencesforidentifyingsportsvideos t a Proc. IEEE Work- 
shop on Multimedia Signal Processing, 1999, Kobla et 
al. "Special effect edit detection using VideoTmils: a 
comparison with existing techniques, m Proc. SPl£ Con- 
ference on Storage and Retrieval for Image and Video 
Databases VII, 1999. KcblaetaK, "Compressed domain 
video indexing techniques using OCT and motion vector 
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information in MPEG video, " Proc. SP1E Conference on 
Storage and Retrieval for Image and Video Databases 
V, SP1E Vol. 3022, pp. 200-211, 1997, and Meng et al. 
"CV£PS - a compressed video editing and parsing sys- 

5 tern, "Proc. ACM Multimedia 96. 1996. 

[0030] As stated above, most prior art summarization 
methods are based on clustering color features to obtain 
color descriptors. While color descriptors are relatively 
robust to noise, by definition, they do not include the mo- 

10 tion characteristics of the video. However, motion de- 
scriptors tend to be less robust to noise, and therefore, 
they have not been as widely used for summarizing vid- 
eos. 

[0031] U.S. Patent Application Sn. 09/406,444 "Aetlv- 
w ity Descriptor for Video Sequences, filed by Divakaran 
et al. describes how motion features derived from mo- 
tion vectors in a compressed video can be used to de- 
termine motion activity and the spatial distribution of the 
motion activfty in the video. Such descriptors are suc- 
so cessful for video browsing applications. Now, we apply 
such motion descriptors to video summarization. 
[0032] We hypothesize that the relative level of activ- 
ity in a video can be used to measure the "summarlza- 
bilfty" of the video. Unfortunately, there are no simple 
25 objective measures to test this hypothesis. However, 
because changes in motion often are accompanied by 
changes in the color characteristics, we investigate the 
relationship between the relative intensity of motion ac- 
tivity and changes in color characteristics of a video. 

so 

Motion and Color Changes 

mr>33] We do this by extracting the color and motion 
features of videos from the MPEG-7 -iest-seL" We ex- 

95 tract the motion activity features from all the P-frames 
by computing the average of motion vector magnitudes, 
and a 64-bin RGB histogram from all the l-f rames. We 
then compute the change in the histogram from l-frame 
to l-frame. We apply a median filter to the vector of 

40 frame-to-frame color histogram changes to eliminate 
changes that correspond to segment cuts or other seg- 
ment transitions. We plot the Intensity of motion activity 
versus the median filtered color change for every frame 
as shown in Figures 2 and 3. 

45 po34] Figures 2 and 3 respectively show the relation- 
ship between intensity of motion activity and color dis- 
similarity for ■jornaldanoiter and "newsl" test sets. 
There is a clear correlation between the intensity of mo- 
tion activity and the change in color. For low activity, it 

so is very clear that the change in cqlor is also low. For 
higher activity levels, the correlation becomes less evi- 
dent as there are many possible sources of high activity, 
some of which may not result in color content change. 
However, when the activity is very low, it is more likely 

55 that the content does not change frame-to-frame. We 
use this information to pre-filtering a video to detectseg- 
ments which are almost static, and hence, these static 
segments be summarized by a single keyframe. Based 
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in these results we provide the following summarization 
method. 



Summarization Method 

[0035] Figure 4 shows a method 400 for summarizing 
an input compressed videoA401 to produce a summary 
S(A)4Q2. .. j 

[0036] The input compressed video 401 is partitioned 
Into "shots" using standard techniques well known in the 
art, and as described above. By first partitioning the vid- 
eo 'into shots, we ensure that each shot is homogenous 
and does not Include a scene change. Thus, we Will 
properly summarize a video of. for example, ten consec- 
utive different talking head" shots that at a semantic lev- 
el would other Wise appear identical. From this point on 
the video is processed on a shot-by-shot manner. 
[0037] Step 410 determines the relative intensity of 
motion activity for each frame of each shot- each frame 
is classified into either a first or second class. The first 
class includes frames that are relatively easy to sum- 
marize, and the second class 412 includes frames that 
are relatively difficult to summarize. In other words, our 
classification is motion based. 
[0038] Consecutive frames of each shot that have the 
same classification are grouped into either an "easy" to 
summarize segment 411 , and a "difficult" to summarize 
segment 41 2. 

[0039] For easy segments 4H of each shot, we per- 
form a simple summarization 420 of the segment by se- 
lecting a key frame or a key sequence of frames 421 
from the segment. The selected key frame or frames 
421 can be any frame in the segment because all frames 
In an easy segment are considered to be semantically 
alike. 

[004O] For dlfficu It segments 41 2 of each shot, we ap- 
ply a color based summarization process 500 to sum- 
marize the segment as a key sequence of frames 431 - 
[0041] The key frames 421 and 431 of each shot are 
combined in form the summary of each shot, and the 
shot summarizes can becombined to form the f inaJ sum- 
mary S(A) 402 of the video. 

[00421 The combination of the frames can use tempo- 
ral, spatial, or semantic ordering. In atemporal arrange- 
ment, the frames are concatenated in some temporal 
order, for exam pie first-to-last, or last-to-first. In a spatial 
arrangement, miniatures of the frames are combined in- 
to a mosaic or some array, for example, rectangular so 
that a single frame shows several miniatures of the se- 
lected frames of the summary. A semantic ordering 
could be mots-to-least exciting, or quite-to-loud. 
[0043] Figure 5 shows the steps of a pref erreo! color 
based summarization process 500. Step 510 clusters 
the frames of each difficult segment 412 according to 
color features into clusters. Step 520 arranges the clus- 
ters as a hierarchical data structure 521 . Step 530 sum- 
marizes each cluster 511 of the difficult segment 412 by 
either extracting a sequence of frames from the cluster 



to generate cluster summaries 531 . Step 440 combines 
the duster summaries to form the key sequence of 
frames 431 that summarize the difficult segment 412. 
[0044] This method is especially effective with news- 
5 video type sequences because the content of the video 
primarily comprises low-action frames of "taltdngheads" 
that can be summarized by key frames. The color-based 
clustering process 500 needs to be carried out only on 
for sequences of frames that have higher levels of ac- 
ta tion, and thus the overall computational burden is re- 
duced. 

[0045] Figure 6 shows the summarization method 400 
graphically- An input video 601 is partitioned 602 into 
shots 603. Motion activity analysis 604 is applied to the 

is frames of the shots to determine easy (e) and difficult 
(d) segments 605. Key frames, segments, or shots 606 
extracted 607 from easy segments are combined with 
color based summaries BOS derived from clustered color 
analysis 609 to form the final summary 610. 

so [0046] In one application, the summary Is produced 
dynamically f romthe compressed video so that the sum- 
mary of the entire video is available to the viewer within 
minutes of starting to "play" the video. Thus, the viewer 
can use the dynamically produced summary to "browse 

as the video, 

[0047] Furthermore, based on the dynamically pro- 
duced summary, the user can request for certain por- 
tions to be resummarlzed on-the-fly. in other words, as 
the video Is played, the user summarizes selected por- 

&o tions of the video to various levels of detail, using the 
summaries themselves for the selection process, per- 
haps, using different summarization techniques for the 
different portions. Thus, our invention provides a highly 
Interactive viewing modality that hitherto now has not 
3s been possible with prior art static summarization tech- 
niques. 

[0048] Although the Invention has been described by 
way of examples of preferred embodiments, "rt is to be 
understood that various other adaptations and roodifi- 
40 cations may be made within the spirit and scope of the 
invention. Therefore, it Is the object of the appended 
claims to cover all such variations and modifications as 
come within the true spirit and scope of the invention. 



so 



55 



Claims 

t. A method for summarizing a compressed video in- 
cluding motion and color features, comprising: 

partitioning the compressed video into a plural- 
ity of shots; 

classifying each frame of each shot according 
to the motion features, a first class frame hav- 
ing relatively low motion activity and a second 
class frame having relatively high motion activ- 
ity; 

grouping consecutive frames having the same 
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classification into segments; 
selecting any one or more frames from each 
segment having the first classification; 
generating a sequence of frames from each 
segment having the second classification using * 
the color features; and 

combining the selected and generated frames 
of each segment of each shot to form a sum- 
mary of the compressed video. 

10 

2. The method of claim 1 further comprising: 

combining the selected and generated frames 
in a temporal order. 

15 

3. The method of dafrn 1 further comprising: 

combining the selected and generated frames 
in a spatial order. 

so 

4. The method of claim 3 further comprising: 

reducing the selected and generated frames in 
size to form miniature f rames. 

25 

5. The method of claim 1 further comprising: 

combining the selected and generated frames 
In a semantic order. 

6. The method of claim 1 further comprising: 

grouping the frames of each segment having 
the second classification into clusters accord- 
ing to the color features; 35 
generating a cluster summary for each cluster; 
and 

combining the cluster summaries to form the 
generated sequences Of frames. 

7. The method of claim 1 wherein the summary is pro- 
duced while playing the video. 

8. The method of claim 1 wherein the summary is used 

to resummarize the video. 45 
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